Overview of data
In the previous lession we applied linear regression to make quantitative predictions. In this lesson, we will learn how a different type of linear regression, logistic regression, can be used to make class or category predictions. In its most basic form, this type of prediction is binary, meaning it has only two options: yes (1) or no (0); disease or no disease, etc. Using the same core data set from the previous lesson, we will attempt to classify children with chronic kidney disease by CKD stage as stage 2 vs stage 3b. The iGFRc column has been removed for this lesson, as this is how CKD stage is determined.
Our data is provided in two files. One has values for the outcome (Stage) for each subject ID and the other includes values for several predictors (e.g., creatinine, BUN, various endogenous metabolites) measured for each subject ID.
We will need to use our previously learned skills to read in the data and join the two sets by subject.
#load in CKD_data.csv and CKD_stage.csv
data <- read_csv("data/CKD_data.csv")
## Parsed with column specification:
## cols(
## id = col_double(),
## SCr = col_double(),
## BUN = col_double(),
## CYC_DB = col_double(),
## Albumin = col_double(),
## uPCratio = col_double(),
## ADMA = col_double(),
## SDMA = col_double(),
## Creatinine = col_double(),
## Kynurenine = col_double(),
## Trp = col_double(),
## Phe = col_double(),
## Tyr = col_double()
## )
glimpse(data)
## Observations: 200
## Variables: 13
## $ id <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16…
## $ SCr <dbl> 1.93, 2.72, 0.88, 0.78, 1.28, 1.41, 0.58, 1.34, 3.42,…
## $ BUN <dbl> 21, 37, 12, 27, 27, 19, 17, 13, 45, 39, 80, 52, 41, 2…
## $ CYC_DB <dbl> 0.82, 1.90, 0.65, 1.31, 1.50, 1.60, 1.04, 1.04, 3.04,…
## $ Albumin <dbl> 4.1, 4.2, 4.2, 4.4, 3.6, 4.2, 4.5, 3.8, 4.1, 4.0, 4.2…
## $ uPCratio <dbl> 1.39, 0.38, 0.33, 0.26, 0.57, 0.04, 0.12, 5.17, 0.26,…
## $ ADMA <dbl> 0.41, 0.69, 0.57, 0.39, 0.87, 0.48, 0.52, 1.14, 0.46,…
## $ SDMA <dbl> 0.70, 1.45, 0.49, 0.33, 2.31, 0.74, 0.46, 1.42, 0.96,…
## $ Creatinine <dbl> 138.61, 274.34, 78.81, 63.05, 226.21, 150.50, 70.84, …
## $ Kynurenine <dbl> 3.98, 7.35, 1.76, 2.41, 5.91, 4.29, 3.19, 6.18, 8.08,…
## $ Trp <dbl> 78.18, 45.02, 58.62, 34.64, 62.36, 52.21, 49.28, 101.…
## $ Phe <dbl> 79.45, 110.28, 74.47, 39.81, 75.40, 58.66, 53.82, 93.…
## $ Tyr <dbl> 77.00, 79.61, 65.22, 44.55, 94.24, 60.66, 68.07, 110.…
stage <- read_csv("data/CKD_stage.csv")
## Parsed with column specification:
## cols(
## id = col_double(),
## Stage = col_character()
## )
glimpse(stage)
## Observations: 200
## Variables: 2
## $ id <dbl> 498, 128, 13, 2, 183, 197, 78, 174, 91, 123, 168, 41, 130,…
## $ Stage <chr> "CKD3b", "CKD3b", "CKD3b", "CKD3b", "CKD3b", "CKD3b", "CKD…
#join by ID, convert ID and Stage variables to factors
ckd <- left_join(stage, data, by = "id") %>%
mutate(id = factor(id),
Stage = factor(Stage))
## left_join: added 0 rows and added 12 columns (SCr, BUN, CYC_DB, Albumin, uPCratio, …)
## mutate: converted 'id' from double to factor (0 new NA)
## mutate: converted 'Stage' from character to factor (0 new NA)
glimpse(ckd)
## Observations: 200
## Variables: 14
## $ id <fct> 498, 128, 13, 2, 183, 197, 78, 174, 91, 123, 168, 41,…
## $ Stage <fct> CKD3b, CKD3b, CKD3b, CKD3b, CKD3b, CKD3b, CKD3b, CKD3…
## $ SCr <dbl> 2.80, 3.14, 4.09, 2.72, 4.49, 3.86, 2.65, 3.06, 3.23,…
## $ BUN <dbl> 44, 45, 41, 37, 53, 44, 37, 52, 34, 47, 51, 47, 38, 6…
## $ CYC_DB <dbl> 2.63, 2.50, 2.97, 1.90, 4.94, 2.93, 2.25, 3.15, 3.04,…
## $ Albumin <dbl> 3.4, 4.2, 3.1, 4.2, 3.4, 4.1, 3.9, 4.5, 3.2, 3.8, 4.8…
## $ uPCratio <dbl> 6.79, 2.01, 1.50, 0.38, 1.11, 0.29, 7.52, 0.04, 14.23…
## $ ADMA <dbl> 0.99, 0.66, 0.77, 0.69, 0.93, 0.85, 0.66, 0.82, 0.75,…
## $ SDMA <dbl> 2.33, 1.68, 1.96, 1.45, 2.22, 2.45, 1.19, 1.77, 1.43,…
## $ Creatinine <dbl> 308.30, 293.63, 521.61, 274.34, 519.03, 512.06, 290.7…
## $ Kynurenine <dbl> 5.57, 8.72, 6.55, 7.35, 4.51, 7.47, 11.15, 9.95, 4.77…
## $ Trp <dbl> 36.89, 42.22, 45.49, 45.02, 47.73, 49.58, 50.33, 79.2…
## $ Phe <dbl> 73.72, 74.61, 116.18, 110.28, 98.99, 82.66, 85.27, 12…
## $ Tyr <dbl> 45.39, 51.82, 66.85, 79.61, 91.04, 68.90, 41.97, 94.2…
#how many subjects do we have? how many variables? how many subjects in each class?
Logistic regresssion
We can think about the probability or likelihood of a binary outcome as being between 0 and 1. Since the values of the outcome are then limited to 0 through 1, we don’t apply standard linear regression. If we tried to do this, our fit may be problematic and even result in an impossible value (i.e., values < 0 or > 1). We need a model that restricts values to 0 through 1. The logistic regression is one such model.
Instead of selecting coefficients that minimized the squared error terms from the best fit line, like we used in linear regression, the coefficients in logistic regression are selected to maximize the likelihood of predicting a high probability for observations actually belonging to class 1 and predicting a low probability for observations actually belonging to class 0.
Assumptions of logistic regression: - The outcome is a binary or dichotomous variable like yes vs no, positive vs negative, 1 vs 0. - There is a linear relationship between the logit of the outcome and each predictor variables. The logit function is logit(p) = log(p/(1-p)), where p is the probabilities of the outcome. - There are no influential values (extreme values or outliers) in the continuous predictors. - There are no high intercorrelations (i.e. multicollinearity) among the predictors.
Similar to the previous lesson, we will split the data, fit a model and then examine the model output on train and test data. In this case, we will use the glm function, which is commonly used for fitting Generalized Linear Models, of which logistic regression is one form. We specify that we want to use logistic regression using the argument family = "binomial". This returns an object of class “glm”, which inherits from the class “lm”. Therefore, it also includes attributes we can explore to learn about our model and its fit of our data.
A major difference is that logistic regression does not return a value for the observation’s class, it returns an estimated probability of an observation’s class membership. The probability ranges from 0 to 1 and value assignment to a class is based on a threshold. The default threshold is 0.5, but should be adjusted for the purpose of the prediction. Simple and multivariate versions of logistic regression are possible. Since we explored the difference with the linear regression, we will start this lesson with the multivariate model we ended with in the previous lesson.
Split the data
The data was provided to you after processing and cleaning, so we are able to skip these critical steps for this lesson. We start our modeling process by splitting our data into 75:25 train:test sets.
set.seed(439) #so we all get same random numbers
train <- sample(nrow(ckd), nrow(ckd) * 0.75)
test <- -train
ckd_train <- ckd[train, ] %>%
select(-id)
## select: dropped one variable (id)
ckd_test <- ckd[test, ] %>%
select(-id)
## select: dropped one variable (id)
Fit the model
We will fit a new model, modGLM, that uses SCr, BUN, and Kynurenine to predict Stage in the training set. As before, we will add the predicted probability values to the training set as a new variable, Stage_prob. The function contrasts shows what R is considering as the reference state for the prediction.
contrasts(ckd$Stage) #what is R considering the reference? CKD3b: 0 = N, 1 = Y
## CKD3b
## CKD2 0
## CKD3b 1
modGLM <- glm(Stage ~ SCr + BUN + Kynurenine, data = ckd_train, family = "binomial")
# what is in .fitted? Log odds.
head(augment(modGLM))
## # A tibble: 6 x 11
## Stage SCr BUN Kynurenine .fitted .se.fit .resid .hat .sigma
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 CKD2 1.46 13 3.57 -4.89 0.986 -0.122 0.00718 0.636
## 2 CKD2 0.89 17 5.3 -3.20 0.674 -0.282 0.0170 0.636
## 3 CKD2 0.93 12 4.06 -5.10 0.924 -0.110 0.00517 0.636
## 4 CKD2 1.59 27 3.95 -0.818 0.444 -0.855 0.0419 0.632
## 5 CKD3b 1.83 60 1.8 7.52 1.80 0.0330 0.00176 0.636
## 6 CKD2 1.16 20 4.09 -2.81 0.568 -0.343 0.0173 0.636
## # … with 2 more variables: .cooksd <dbl>, .std.resid <dbl>
# If we want probabilities for comparison, then we need to predict the train using type = "response"
#add the predicted values to the train set and set the type argument to response
ckd_train <- ckd_train %>%
mutate(Stage_prob = predict(modGLM, ckd_train, type = "response"))
## mutate: new variable 'Stage_prob' with 150 unique values and 0% NA
#using threshold 0.5, convert probabilities to predicted stage
ckd_train$Stage_pred<- ifelse(ckd_train$Stage_prob > 0.5, "CKD3b", "CKD2")
How did the model do at predicting stage in our training data? We can calculate the accuracy of the model and plot the density of the predicted probabilities by class.
#calculate accuracy, if == statement is TRUE, value = 1, otherwise = 0
mean(ckd_train$Stage_pred == ckd_train$Stage)
## [1] 0.9
ggplot(ckd_train, aes(Stage_prob, color = Stage)) +
geom_density()

Examining our model
Recalling the helpful functions we used from the broom package, we can examine our model. We see that the parameters for the logistic regression model are different than those we saw in the previous lesson on linear regression. R2 is not relevant for logistic regression. Instead, to compare models, we rely on parameters called AIC and BIC. These are the Akaike Information Criterion and the Bayesian Information Criterion. Each tries to balance model fit and parsimony and each penalizes differently for number of parameters. Models with the lowest AIC and lowest BIC are preferred.
Exercise 1:
Examine modGLM using the glance() and tidy() functions of the broom package. What is the AIC and BIC for this model? What are the coefficients for each term of the model?
glance(modGLM)
## # A tibble: 1 x 7
## null.deviance df.null logLik AIC BIC deviance df.residual
## <dbl> <int> <dbl> <dbl> <dbl> <dbl> <int>
## 1 208. 149 -29.4 66.7 78.8 58.7 146
#AIC 67
#BIC 79
# The difference between the null deviance and the residual deviance shows how our model is doing against the null model (a model with only the intercept). The wider this gap, the better.
# logLik
tidy(modGLM) %>%
arrange(p.value)
## # A tibble: 4 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) -10.3 1.88 -5.48 0.0000000423
## 2 BUN 0.278 0.0559 4.97 0.000000662
## 3 Kynurenine 0.413 0.200 2.07 0.0387
## 4 SCr 0.238 0.731 0.326 0.744
# The logistic regression coefficients (estimate) give the change in the log odds of the outcome for a one unit increase in the predictor variable. You can take the exp and convert this to odds ratio.
exp(coef(modGLM))
## (Intercept) SCr BUN Kynurenine
## 3.268375e-05 1.269173e+00 1.320145e+00 1.511967e+00
# Now we can say that for a one unit increase in SCr, the odds of being in group CKD3b (vs in CKD2) increase by a factor of 1.27
End exercise
Examining collinearity
As mentioned before, we need to be careful when several predictors have strong correlation. Remember that we can calculate the variance inflation factor (VIF) for each model to determine how much the variance of a regression coefficient is inflated due to multicollinearity in the model. We want VIF values close to 1 (meaning no multicollinearity) and less than 5.
vif(modGLM)
## SCr BUN Kynurenine
## 1.142340 1.146886 1.007098
There does not seem to be a collinearity problem in our model.
Making predictions from our model
When we use the predict function on this model, it will predict the log(odds) of the Y variable. This is not what we ultimately want since we want to determine the predicted Stage. To convert it into prediction probability scores that are bound between 0 and 1, we specify type = “response”.
#predict on test
table(ckd_test$Stage) #CKD3b ~ 50%
##
## CKD2 CKD3b
## 25 25
ckd_test$Stage_prob <- predict(modGLM, ckd_test, type = "response")
With the predicted probabilities, we can now apply a threshold and assign each row to either the CKD3b or CKD2 class, based on probability. We will start with a threshold of 0.5. We know the actual assignment from the Stage column (of this training data) so we can calculate the accuracy of our model to predict class.
ckd_test$Stage_pred<- ifelse(ckd_test$Stage_prob > 0.5, "CKD3b", "CKD2")
mean(ckd_test$Stage_pred == ckd_test$Stage) #0.98
## [1] 0.98
Exercise 2:
Select a different threshold and determine the accuracy of the model for that threshold setting.
ckd_test$Stage_pred<- ifelse(ckd_test$Stage_prob > 0.6, "CKD3b", "CKD2")
mean(ckd_test$Stage_pred == ckd_test$Stage) #0.96
## [1] 0.96
End exercise
Build ROC curve as alternative to accuracy
Sometimes calculating the accuracy is not good enough to determine model performance (especially when there is class imbalance and accuracy can be misleading) and using a threshold of 0.5 may not be optimal. We can use the pROC package functions to build an ROC curve and find the area under the curve (AUC) and view the effects of changing the cutoff value on model performance.
# Convert Stage to numeric probability variable (0 or 1)
ckd_test <- ckd_test %>%
mutate(Stage_num = ifelse(Stage == "CKD3b", 1, 0))
## mutate: new variable 'Stage_num' with 2 unique values and 0% NA
# Create a ROC curve object from columns of actual and predicted probabilities
ROC <- roc(ckd_test$Stage_num, ckd_test$Stage_prob)
# Plot the ROC curve object
ggroc(ROC, alpha = 0.5, colour = "blue")

# Calculate the area under the curve (AUC)
auc(ROC)
## Area under the curve: 0.9952
As expected, we were able to build a strong classifier model. Most real-world situations have less separation than we found in this lesson. In those cases, one must consider the purpose of the classifier and weight the importance of false positives versus false negatives. The ROC curve is helpful to find the optimal cutoff in those cases. Additional calculations of a confusion matrix to determine the sensitivity and specificity of the model would also be warranted.